Network properties determine neural network performance

Machine learning influences numerous aspects of modern society, empowers new technologies, from Alphago to ChatGPT, and increasingly materializes in consumer products such as smartphones and self-driving cars. Despite the vital role and broad applications of artificial neural networks, we lack systematic approaches, such as network science, to understand their underlying mechanism. The difficulty is rooted in many possible model configurations, each with different hyper-parameters and weighted architectures determined by noisy data. We bridge the gap by developing a mathematical framework that maps the neural network’s performance to the network characters of the line graph governed by the edge dynamics of stochastic gradient descent differential equations. This framework enables us to derive a neural capacitance metric to universally capture a model’s generalization capability on a downstream task and predict model performance using only early training results. The numerical results on 17 pre-trained ImageNet models across five benchmark datasets and one NAS benchmark indicate that our neural capacitance metric is a powerful indicator for model selection based only on early training results and is more efficient than state-of-the-art methods.

The main contribufion of this work is to use network stafisfics, called network capacitance \beta_{eff}, to predict the validafion loss curve and/or final performance of the neural network being trained.
Overall I think the paper has a lot of confusing parts.
First, the definifion of \beta_{eff} and how it is used is quite confusing.Is \beta_{eff} a stafic quality that is only related to the topology of the network?If so why Theorem 1 holds (G_A converges and thus \beta_{eff} is zero)?If that's not the case, it is befter to clearly specify what network quanfifies the adjacent matrix P depends on (e.g., network weight, backpropagated gradient, data labels, etc), to make things clearer.
From its definifion, $\beta_{eff}$ seems to be a scalar (i.e., a single number) that changes over fime, but it was used to predict the validafion accuracy $I$ (also changing over fime) with a regularized linear regression (Page 6 at front), and leads to very strong performance (Fig. 2-4).This seems to be too good to be true.I am not convinced that the scalar \beta_{eff}, summarized from a complicated neural network, contains sufficient informafion to predict the accuracy I. Also I don't know how can we apply "Bayesian ridge regression" on scalar (1-D) input?Can the authors elaborate on how it is done?The secfion about "Bayesian ridge regression" is too general to be useful.Page 5 menfions the three-layer neural capacitance probe (NCP) unit, which is supposed to predict the performance of neural networks, yet it has nothing to do with \beta_{eff}?I am really confused how the predicfion works.
Second, the experiments are also inadequate.A lot of ablafion study are missing, as pointed out in the following: 1. What happens if the training converges but overfits (and according to Theorem 1, \beta_eff = 0), but the validafion accuracy is high?Does \beta_eff sfill gives strong performance?The authors should do studies at different training regions, otherwise the conclusion is misleading.

Data and methodology
Overall, the metric proposed in this paper is novel.I suggest the following to further strengthen the paper: • The authors should give some toy example(s) to illustrate how exactly their proposed metric gets computed step by step.Currently, Fig. 1 only delivers the high-level idea, so more details are needed.
• Although it's left for future work, I think it is necessary to use standard benchmarks for neural architecture search (NAS), e.g., NASBench201, for results reproducibility and easier comparison with other approaches.
• GPU-hour spent for esfimafing the accuracy is necessary for comparison with exisfing works.This should include the fime spent on model warmup epochs (training epochs before the starfing epoch t_0), data collecfion epochs, and metric computafion.
• Comparison with more related work is necessary.For example, reference [6] above provides a comprehensive survey on exisfing works on Zero-shot NAS, which is referred as TM in paper.

Analyfical approach
The reason why a high network resilience may indicate a high validafion accuracy is not clearly specified.Also, the proposed metric is only proved that it can indicate that the network achieves the opfimal point.It would be befter if the authors show some theorefical results w.r.t the convergence rate before reaching the opfimum.Finally, the relafionship between the generalizafion capacity and the proposed metric can also make this paper even stronger.

Suggested improvements
• Use publicly available NAS benchmarks for evaluafion, parficularly large datasets like ImageNet, Places365, and provide direct comparisons with exisfing approaches (SOTA).
• Provide results on more diverse CNN.Currently, the only difference among selected CNNs are their depth and basic blocks.The authors should consider more diverse CNNs, e.g., Wide-Resnet.
• Report GPU-hours as an esfimafion of the computafional cost.
• Provide more details on the relafionship between resilience and accuracy.
• [Minor] Can the author show some results beyond the vision, such as language tasks?

Clarity and context
Comparison with related works is necessary.Also, authors need to improve the clarity of the proposed method.In parficular, the construcfion of the metric (Eq.2) shows that the metric is determined by the adjacency matrix P. As shown in Eq. 1, g_ij is the weight for the edge, while the adjacency matrix P only contains 0 or 1.However, this is not the case in Eq. 2 as the authors claim that the graph is a weight graph.
AR: We would like to clarify that β eff is not a static quality, but rather a scalar metric derived from both the topology and weights of the network.Although the topology is static, weights change during the training process, leading to the changes in β eff .
More specifically, we aimed to study the training dynamics over the weights of the neural network G A .To facilitate our study with the well-developed techniques in network science, we modeled the training dynamics with a directed network G B , where each node is related to a weight in G A , and each edge is related to the interaction between a pair of weights in G A .If the nodes' states in G B reach equilibrium, it indicates that the weights of G A become stable and may reach the optimal.The adjacency matrix P is a "weighted" matrix over G B , where each entry in P is associated with an edge of G B and describes the interaction strength between a pair of weights in G A .The reviewer is correct that this matrix P depends on network weight, gradients, data labels, etc.We have added more descriptions in the revised manuscript.
We add the following description to the revised manuscript (first line in Section "Property of the neural capacitance"): "According to Eq.( 7), we have the weighted adjacency matrix P of G B in place.The matrix P encodes rich information of the network, such as the topology, the weights, the gradients, and the training labels indirectly." RC: From its definition, β eff seems to be a scalar (i.e., a single number) that changes over time, but it was used to predict the validation accuracy I (also changing over time) with a regularized linear regression (Page 6 at front), and leads to very strong performance (Fig. 2-4).This seems to be too good to be true.I am not convinced that the scalar β eff , summarized from a complicated neural network, contains sufficient information to predict the accuracy I. Also I don't know how can we apply "Bayesian ridge regression" on scalar (1-D) input?Can the authors elaborate on how it is done?The section about "Bayesian ridge regression" is too general to be useful.
AR: The reviewer's interpretation of β eff is correct.β eff is a 1D metric that measures the performance of a neural network.It is a highly compressed metric that includes the topology, weights, and gradients of the neural network.As the training progresses, both β eff and the validation accuracy I change over time.β eff approaches zero as the training converges.This is one novel discovery of this work.
It is not our intention to use β eff to completely reflect all the impacts during training.For example, For example, the learning rate, the optimizer, and many other hyper-parameters can affect β eff values.We are leveling up the validation accuracy collected from the early training epochs to capture these missing impacts.We formulate the implicit relation between the validation accuracy and the proposed Our approach aligns with existing learning curve prediction approaches, which seek to learn a non-linear predictor of validation accuracy.The prediction accuracy increases if they observe for a long enough period of time with more data points from the training trajectory.In practice, our β eff can also be used together with other metrics for performance prediction.However, our approach can achieve accurate prediction even with very few early training epochs.As the reviewer stated, "This seems to be too good to be true."We also had the same feeling when we saw these results.We confirmed these surprisingly good results when we proved Theorem 1 and conducted extensive simulations to validate it.
Regarding the use of Bayesian ridge regression, we described in Fig. 2 (row 3) in the main manuscript how to use it to estimate the relation between β eff and I with a few observations on the training trajectory.We would like to emphasize that Bayesian ridge regression is applied to the observed pairs of training β eff and validation accuracy rather than 1D scalar input.
For clarification, we added some extra notations to section Bayesian ridge regression in Methods: We can summarize the application of Bayesian ridge regression to our framework as follows: • Inputs: {(β eff,k , I k )|k = 1, 2, . . ., K} is a set of observations, where β eff,k is the proposed metric calculated from the training set, I k represents the validation accuracy, K is the total number of observations collected from early stage of the model training.

Valid Acc
Observation window

Start epoch
Final epoch β eff,k s and validation accuracy I k s, and then to extrapolate the final accuracy when β eff → 0, which we believe will be held when the model is converged.To be noted that some of the very early epochs may be noisy and are discarded from the regression.The number of observations and the start epoch of the observation window are determined by the Bayesian information criterion (BIC).See the last sentence of subsection "Neural network model selection with the neural capacitance β eff " in the main manuscript for details.
RC: Page 5 mentions the three-layer neural capacitance probe (NCP) unit, which is supposed to predict the performance of neural networks, yet it has nothing to do with β eff ?I am really confused how the prediction works.

AR:
The NCP unit is placed on top of the neural network G A to replace the original output layer.One of the main purposes of the NCP unit is to calculate a surrogate β eff because calculating β eff for the entire network with all weights in G A is computationally prohibitive.We are transferring the knowledge learned from ImageNet to new data sets.The bottom layers of the neural network represent low-level features of the images, and the NCP unit captures the high-level, determinant features for classification.Because of this role, we calculate the partial β eff , and it is still able to predict the performance of the entire network.
We add revision in the revised manuscript (last paragraph of Section "Property of the neural capacitance"): "Because of this, we seek to derive a surrogate from a partial of G A .Specifically, we insert a neural capacitance probe (NCP) unit . . ." RC: Second, the experiments are also inadequate.A lot of ablation study are missing, as pointed out in the following: AR: We appreciate the reviewer's insightful reviews and valuable suggestions regarding ablation studies.In response, we have incorporated supplementary experiments and expanded our descriptions in the revised manuscript to enhance clarity.We are confident that these changes contribute significantly to improving the paper.
RC: 1.What happens if the training converges but overfits (and according to Theorem 1, β eff = 0), but the validation accuracy is high?Does β eff still gives strong performance?The authors should do studies at different training regions, otherwise the conclusion is misleading.
AR: There may be a misunderstanding.The proposed approach does not intend to predict the capacity of the neural network without considering the impacts of various factors (e.g., hyper-parameters) on the final performance of the model.Overfitting happens when the model learns the training data too well (and may only memorize the seen examples without essentially learning the knowledge from them), yet it is unable to generalize to new data.This phenomenon becomes apparent when the training accuracy continues to rise while the testing accuracy starts to decline.In the presence of overfitting, the testing/validation accuracy typically is lower than the training accuracy.
The degree of overfitting and the overall performance of a model is influenced by a variety of factors, and properties related to β eff may also be distorted.without addressing overfitting.Given that overfitting is primarily induced by model complexity, and in our case, both the retrained model and NCP remain fixed, the model's complexity remains unchanged.However, since the dropout rate in NCP is typically employed to mitigate overfitting, we attempted to reproduce an overfitting scenario by reducing the dropout rate from 0.4 to 0.1.Regrettably, we did not observe overfitting under these conditions, as shown in Figs. 2, none of the models exhibit overfitting.One of the contributing factors is that the pre-trained models may have been well-trained and reside in a favorable position within the optimal loss basin.
In addition to apply different dropout rate, we also examined the effect of different batch sizes.The result is shown in Figs. 3. We did not observe overfitting neither using different batch size.RC: 2. If β eff contains only the training information, there always exists a validation set so that the prediction fails.Therefore, the prediction accuracy should critically depend on some sort of distances between the training and the validation distribution.Such a study is also missing.

AR:
We are grateful for the reviewer's insightful feedback.The metric β eff is built on the adjacency matrix P , which encodes the topology information, the training information, the weights of the network, as well as the gradients information.Since the weights are updated during training to reduce the discrepancy between the labels and the predictions, we admit that β eff also indirectly contains the training labels information.We agree with the reviewer that there always exists a validation set to fail the prediction and the prediction accuracy of our approach depends on the distribution distance between the training and validation distribution.We would like to emphasize that the underlying assumption of the proposed approach aligns with the implicit assumption of all supervised learning approaches.The statements are built on a fundamental assumption of all supervised learning approaches, i.e., the training and validation data are drawn from the same distribution.If the data is skewed, it will be difficult to rely on the validation accuracy to select the best-trained model or to make reliable predictions.We used β eff from the training set to predict the validation accuracy, which essentially bridges the distribution shift.More epochs of observations are shown to improve the prediction quality.
RC: 3. β eff seems to also contain the training labels info, in addition to the weights of the network (otherwise there is no way to know when the training converges).In that case, a baseline would be just to use training accuracy (as a time series) to predict validation accuracy I.
AR: We thank the reviewer for suggesting this baseline, which is exactly one of our baseline training curve predictors.We have two learning curve predictor baselines, BSV and LSV, to predict the final validation accuracy.Instead of using the training accuracy, both baselines use the validation accuracy values collected during training to predict the final validation accuracy.
BSV uses the best validation accuracy seen so far, and LSV uses the last validation accuracy seen as the prediction of the final accuracy.
RC: 4. What are the other baselines shown in Table 1?What are BSV, LSV? Are there any references and how did they computed?If they are computed based on the network weights only (but not the labels in the training set), then it is not a fair comparison.Please clarify.
AR: Table 1 compares our proposed method to four learning curve predictors (including two heuristic rules BSV and LSV, and two advanced methods BGRN and CL), and three transferability metrics NCE, LEEP, and LogME.The conceptual differences between these approaches are discussed in Section "Comparison with other approaches" from P8 to P9: ". . .our approach has access to some observations collected from early training, and therefore our prediction mechanism is more similar to learning curve prediction than those TM-based approaches which are designed as a surrogate of the transferability without fine-tuning or re-training." In terms of fairness, the proposed metric β eff , BSV, and LSV do not directly access the training labels.BSV uses the best-observed validation accuracy value, while LSV uses the last observed validation accuracy value as a prediction of the final accuracy [2,8].
RC: Also please provide a conceptual comparison between the proposed method and previous methods in Table 1.

AR:
We have mentioned the conceptual comparison between the proposed method and previous methods in Section "Comparison with other approaches" (Page 9).The key points can be summarized as follows: Two families of previous predictors are considered in our comparison analysis: transferability measures (TMs) and learningcurve-based predictors (LCPs).TMs are proposed to quantitatively estimate how easy it is to transfer knowledge learned from a source task to a target task.The idea of LCPs is to extrapolate the partial learning curve using a combination of continuously increasing basic functions.Our approach also accesses partial observations of the learning curve collected from early training, and therefore it is more similar to LCPs than TMs.
RC: Other points: In page 3, the authors mentioned that network reduction approach is used to decouple the network system, how does it affect gradient flow in the reduced network?From Theorem 1 it is clear that β eff depends on not only network weights but also backpropagated gradient.
AR: The universal network reduction approach (GBB reduction) [1] has been widely used in many real-world networks [7,6,4,5].One of the core techniques is mean-field theory.In short, the interactions between a node and all other nodes can be approximated by the interaction between the node and a "virtual" super-node that represents all other nodes.The Ndimensional states of the nodes in the system can be captured by a 1D effective state x eff as f (x eff ) + β eff g(x eff , x eff ) = 0, and the equilibrium states of the system can therefore be solved from f (x * i ) + (P * 1)g(x * i , x eff ) = 0.The metric β eff measures the resilience of the network, and it depends on the network weights as well as the gradients.
The reduction approach derives a powerful 1D metric to measure the global property of the network, but it does not change the gradient flow in the reduced network at all.RC: One of the key steps in the methodology side is to linearize the dynamics (Eqn.7).How does it affect the conclusion?Is the linearized dynamics a sufficient characterization of the original nonlinear dynamics?
AR: Linearizing the dynamics allows us to analyze the behavior of neural networks in a more straightforward way.The motivation for linearizing the dynamics is to reformulate the training dynamics in the same form as the general dynamics described in Eqn. 1, and to analyze the behavior of neural network training.The general dynamics is characterized by three components: a self-driving force f (•), an external driving force g(•, •) and an adjacency matrix P (Eqn.1).To obtain the adjacency matrix P , we decouple the nonlinear gradient by applying the linearization at W * .
The linearized dynamics is a sufficient characterization of the original nonlinear dynamics: dW (ℓ) /dt is a function of W (ℓ+1) instead of approximation of it, and W (ℓ+1) is an explicit term in Eqn. 8.
The derivation comes from the property of backpropagation, and there is no approximation.As described in Eqs. 8 and 9, W (ℓ+1) contributes a direct impact on the gradient, and the other weights in higher layers also affect it, implicitly and indirectly via W (ℓ+1) .When doing backpropagation, the weights on the higher layers are frozen and their impacts are propagated backwardly to W (ℓ+1) when updating the weights W (ℓ) .These indirect impacts on the gradient are fully considered in our analysis.

RC:
In Theorem 1: what do you mean by "G A converges"? Do you mean training G A until it hits the local optimal of the loss function?or it fits with all the labelled data?I checked supplementary materials (S4) and seems that you use the condition that gradient is zero.What happens to SGD case when the gradient is never zero?
AR: The convergence of G A is established on zero gradients, a practice commonly employed in analyzing complex systems.The underlying dynamical system, as defined in Eqn. 1, attains convergence when the entire system reaches an equilibrium state.This equilibrium state is directly associated with the weight values of the neural network, where the gradients of these weights approach zero.Even though the condition of zero gradients may never occur, the theoretical analysis simplification is still effective.
AR: We apologize for any missing references in the manuscript.This is due to the use of cross-references between the main and supplementary materials.
Correction in the revised manuscript: • S3, 1st line: The right hand side (RHS) of Eq.( 5) is a function of . . .

• S3
, 3rd to the last line: The system can be viewed as a realization of the general Eq.( 1), with linear . . .

Reviewer #2
RC: Overall, this is an interesting paper.Below, I provide some constructive feedback that should help improve the quality of the manuscript.
Key Results: this paper proposes to use network resilience metric β eff as a proxy of the model accuracy.More precisely, to estimate the accuracy of a neural network, the proposed method works as follows: • First, replace the classification head with a random-initialized neural capacitance probe • Then, train the model for several epochs on the target dataset with the weight in the probe frozen.For each epoch, compute the resilience metric β eff based on the second-order gradient of the weights and collect the validation accuracies.
• Finally, fit a linear model with input β eff and output the collected validation accuracy.The bias of the linear model is the predicted accuracy.
The authors also provide a proof that links the proposed metric and convergence of deep networks.The predicted accuracies on the selected models appear to be highly correlated with the actual accuracies.
AR: We appreciate the reviewer's thoughtful summary of our key results and are pleased that it is perceived as interesting.We have carefully considered your constructive comments and made the required revisions accordingly.We believe that the updated manuscript aligns with the standards for publication in Nature Communications.
RC: Validity: The authors show results on vision tasks with some CNN models.However, the number of sampled neural networks is too small to support their main claims.Standard NAS benchmark should be used for evaluation.
AR: We appreciate the reviewer's comments regarding the paper's validity.We agree with the suggestion to evaluate our approach on standard NAS benchmarks.In response, we have incorporated additional experiments using NAS-Bench-201, showcasing the improved performance of our approach.We measure and compare the ranking quality in terms of Spearman's ranking correlation ρ of our approach with the baseline models BSV and LSV.It shows that ours can achieve ρ = 0.76, better than BSV's ρ = 0.68 and LSV's ρ = 0.68.AR: We thank the reviewer for bringing our attention to these related papers.We cited the suggested papers in our revised manuscript (see references 46-51) and added in Section "Comparison with other approaches":

RC:
. . .and many of them are training-free metrics for assessing the performance of neural networks.
Given the similarities between our approach and learning curve-based methods, it would be more fair to compare our proposed metric approach against the learning curve-based prediction approaches than these training-free metrics.However, in response to the reviewer's highlighted reference [3], we compare our metric with the proposed one.The results of this comparison demonstrate the superiority of our approach.The training-free method ZiCo yields a Spearman's ranking correlation of ρ = 0.59, significantly lower than our ρ = 0.76.
RC: Data and methodology: Overall, the metric proposed in this paper is novel.I suggest the following to further strengthen the paper: The authors should give some toy example(s) to illustrate how exactly their proposed metric gets computed step by step.Currently, Fig. 1 only delivers the high-level idea, so more details are needed.
AR: We are pleased that the reviewer recognizes the novelty of our approach.Also, we thank the reviewer for the suggestion to strengthen our paper.While Fig. 1 in the manuscript shows the high-level idea of our approach, we also have shown in See the 1st paragraph that immediately follows Algorithm 1: For an MLP G A , it is possible to derive an analytical form of β eff .However, it becomes extremely complicated for a deep neural network with multiple convolutional layers.To realize β eff for deep neural networks in any form, we take advantage of the automatic differentiation implemented in TensorFlow.
Suppose we are now identify the interaction strength P ij between weight W i and W j as shown in Fig. 1 (a), we need to apply the advanced automatic differentiation in Tensorflow to derive the second-order gradient over the training set (Step 4), we then apply Eq. 2 to compute the proposed metric: We are confident that presenting this breakdown will aid the reviewer in gaining a thorough understanding of the computation process underlying the proposed metric.
RC: Although it's left for future work, I think it is necessary to use standard benchmarks for neural architecture search (NAS), e.g., NASBench201, for results reproducibility and easier comparison with other approaches.
AR: We appreciate the reviewer's suggestion to evaluate our approach on standard benchmarks for Neural Architecture Search (NAS).In response, we randomly sampled 100 neural networks from NAS-Bench-201 and assessed the ranking quality using our proposed metrics, along with two learning curve prediction approaches, LSV and BSV.The figures below demonstrate that our approach exhibits superior ranking quality compared to the baseline methods.AR: We appreciate the reviewer's suggestion and have incorporated the GPU-hour spent for accuracy prediction, as follows: RC: Comparison with more related work is necessary.For example, reference [3] above provides a comprehensive survey on existing works on Zero-shot NAS, which is referred as TM in paper.AR: We thank the reviewer for providing the reference [3], and we applied it to NAS-Bench-201 and compared it with our approach.It's shown that our approach outperformed the proposed TM metric in [3].As shown in Figure 3 and Figure 6, our approach's ranking quality is ρ = 0.76 much better than ρ = 0.59 of the suggested metric in reference [3].
RC: Analytical approach: the reason why a high network resilience may indicate a high validation accuracy is not clearly specified.Also, the proposed metric is only proved that it can indicate that the network achieves the optimal point.It would be better if the authors show some theoretical results w.r.t the convergence rate before reaching the optimum.Finally, the relationship between the generalization capacity and the proposed metric can also make this paper even stronger.
AR: We appreciate the reviewer's suggestion to include more theoretical findings.However, the network conversion from G A and G B introduced some non-smoothing operations, and the definition of the weighted adjacency matrix is built on the gradients.These conversions introduce complexity into the derivation of the coverage rate before reaching the optimum.Currently, we are still diligently exploring a feasible approach to establish the relationship between the generalization capacity and the proposed metric.

RC: Suggested improvements
• Use publicly available NAS benchmarks for evaluation, particularly large datasets like ImageNet, Places365, and provide direct comparisons with existing approaches (SOTA).
• Provide results on more diverse CNN.Currently, the only difference among selected CNNs are their depth and basic blocks.The authors should consider more diverse CNNs, e.g., Wide-Resnet.
• Report GPU-hours as an estimation of the computational cost.
• Provide more details on the relationship between resilience and accuracy.
• (Minor) Can the author show some results beyond the vision, such as language tasks?AR: We appreciate the detailed suggestions by the reviewer.We have added required materials in the revised manuscript to address the comments.Here we simplify the responses with some key pointers to our revision.
Regarding comparison with publicly available NAS benchmarks.We added some additional experiments on NAS-Bench-201 to compare our approach with the basedlines LSV and BSV as well as one additional training-free metric ZiCo.It's found that our approach is the best on the benchmark dataset.
Regarding comparison with more diverse CNN models.We are grateful to the reviewer for bringing this to our attention.The pre-trained models reported in the manuscript encompass a broad spectrum, incorporating some of the most widely recognized neural networks.Upon promptly examining the recommended Wide-ResNet, we observed that the sole distinction between Wide-ResNet-50-2 and ResNet-50 lies in the last block-one featuring 2048-1024-2048 channels, while the other boasts 2048-512-2048 channels.Given the existing array of ResNet variations in terms of depth and width, we opted not to replicate the results for Wide-ResNet.
Regarding GPU hours.We thank the reviewer for requesting the GPU hours, so we repeated the experiment, and reported the GPU hours in Table 1.
Regarding the relationship between resilience and accuracy.We appreciate the reviewer's suggestion to expand our theoretical findings regarding the connection between resilience and accuracy.The current theorem (Theorem 1) provides foundational insights into these relationships.However, due to the inherently non-smooth nature of the operations involved in the conversion from a neural network G A to a line graph G B , we acknowledge the need for further refinement.We are actively exploring more effective approaches to enhance our understanding of the intricate relationship between resilience and accuracy, aiming to surpass the limitations of our current theoretical framework.
Regarding results on language task.Our approach mainly concentrates on the vision tasks, and the complexity involved in adapting our implementation to address language tasks warrants careful consideration.As a result, we intend to explore this aspect as part of our future work.
RC: Clarity and context: Comparison with related works is necessary.Also, authors need to improve the clarity of the proposed method.In particular, the construction of the metric (Eq.2) shows that the metric is determined by the adjacency matrix P. As shown in Eq. 1, g ij is the weight for the edge, while the adjacency matrix P only contains 0 or 1.However, this is not the case in Eq. 2 as the authors claim that the graph is a weight graph.
AR: We appreciate the reviewer's suggestion to improve the clarity of our approach.It is possible that the reviewer misunderstood the meaning of g ij in Eqn. 1, which does not represent the weight of the edge, but describes how node j interacts with node i.
The adjacency matrix P can be either a binary or a weighted matrix.In our case, the adjacency matrix is a weighted one, and the corresponding g ij for the training dynamics is w j − w * j .

REVIEWER COMMENTS
Reviewer #1 (Remarks to the Author): Thanks for the authors' feedbacks.I really appreciate them.
The explanafion of Bayesian Ridge regression helps a lot.Now I understand how the regression is done (by collecfing samples from the early stage of training).
I have follow-up quesfions.
1.The authors say that the three-layer neural capacitance probe (NCP) unit network is to compute a surrogate for \beta_eff, which is computafionally prohibifive.How did you train this NCP network?Did you use ground truth \beta_eff as a supervised signal?There are two possibilifies: a) If you use the ground truth signal of \beta_eff as the supervision, then is it sfill computafional costly to train the NCP?How long will it take to do the training?Please report.
b) If you choose to train NCP directly with the final accuracy (e.g., using the output of NCP for ridge regression and do end-to-end backpropagafion), then it is not clear whether the output of NCP is actually connected to \beta_eff.Please clarify.

2.
When I asked about overfifting, I do not argue that the model will overfit on the real-world dataset (actually overfifting does not happen, as shown in the addifional experiments).Instead, I want to see a scenario that when the model overfits and have very low validafion accuracy, your approach can sfill predict the validafion accuracy very well.Otherwise the proposed method is not tested beyond the common training region and I am not convinced.To make the model overfit, we could just use a small subset of training set and see whether the proposed approach gives good predicfion of the final accuracy.
3. Thanks the authors to provide the backgrounds of the baseline methods that also predict the final performance given the early stage training.There may exist other straighfforward baselines, e.g., predicfing the validafion accuracy from the name of the dataset, from training hyperparameters (e.g., lr, length of epochs, batchsize, #parameters, etc) that are not depending on the actual training dynamics (e.g., changing weights of the networks).We would like to see whether \beta_eff actually make a difference here, which is the main contribufion of the paper.
Reviewer #1 (Remarks on code availability): See the comment above.
Reviewer #2 (Remarks to the Author): This reviewer appreciates authors' effort to revise the inifial manuscript and address individual comments.While this revised version addresses some of the inifial concerns, a few important ones need further aftenfion: 1. Comparison with SOTA -First, the comparison against ZiCo needs to provide all the details of the setup used; without these details, it's impossible to understand the significance of the comparison and enable reproducibility of the results.
-Second, comparison against ZiCo only is not enough; comparisons against another approaches are needed to see the consistency of the results.At the very minimum, comparisons against Zen-Nas and NASWOT (refs [1] and [2] in the Reviewer#2 inifial comment) are needed, with all details given in the Supplementary, to understand how this approach fares compared to exisfing approaches.
-Third, authors comment about the superiority of their approach compared to ZiCo needs to be rephrased since their approach involves training, hence it is expected to be provide befter accuracy compared to zero-shot approaches.On the other side, zero-shot approaches have their inherent advantages so it's hard to talk about superiority in such a context.

Including authors relevant responses in the main manuscript/Supplementary
Some of author responses to this reviewer quesfions need to be included in the main manuscript and/or Supplementary, as appropriate; this is crifical for improving the paper readability and overall contribufion.As of now, very liftle material from authors responses is actually included in the revised version of the inifial paper.For instance, the comparisons against SOTA required above, as well as plots of Figs.4-6 discussed in authors response need to be included in the Supplementary.Same about Table 1 (GPU fimes).On the other hand, authors comment regarding the relafionship between resilience and accuracy should be menfioned in the main paper as current limitafion.For 1), we randomly initialize NCP and freeze it during the fine-tuning of the neural network.Since the NCP remains frozen, the fine-tuning process does not involve any training for the NCP.For 2), NCP consists of only 3 layers, which is much smaller than the entire neural network.As highlighted in the main manuscript (see the last two paragraphs of "Neural capacitance" in Methods), computing β eff for the entire network is prohibitively expensive, so we seek to compute the β eff with NCP.
s , F s } with bottom F s and output layer F s , target dataset D t , maximum epoch T s on D t for epochs of T 4: Obtain P from U according to Eq. ( 7) 5: Compute β eff with P according to Eq. ( 2) RC: When I asked about overfitting, I do not argue that the model will overfit on the real-world dataset (actually overfitting does not happen, as shown in the additional experiments).Instead, I want to see a scenario that when the model overfits and have very low validation accuracy, your approach can still predict the validation accuracy very well.Otherwise the proposed method is not tested beyond the common training region and I am not convinced.To make the model overfit, we could just use a small subset of training set and see whether the proposed approach gives good prediction of the final accuracy.
We test the performance using about 0.5% of the original training data (batch size is 64 and the number of minibatches is 4).RC: Thanks the authors to provide the backgrounds of the baseline methods that also predict the final performance given the early stage training.There may exist other straightforward baselines, e.g.predicting the validation accuracy from the name of the dataset, from training hyperparameters (e.g.lr, length of epochs, batchsize, #parameters, etc) that are not depending on the actual training dynamics (e.g.changing weights of the networks).We would like to see whether β eff actually make a difference here, which is the main contribution of the paper.

AR:
We agree that some auxiliary information of the network and dataset can be used as a prediction.Since in our experiments we use the same number of total epochs and batch size for all models in each dataset, these two auxiliary features won't be useful for performance prediction.On the other hand, the reviewer is correct that the number of parameters is commonly used as a competitive training-free baseline in the prediction of the model performance [2].We compared our method with the total number of parameters as a predictor.The same architectures in NAS-Bench-201 are used and we compute the Spearman correlation ρ between the number of parameters (i.e.model size) and the test accuracy.We find that our method has ρ = 0.76 while the number of parameters as metric only has ρ = 0.52.It is worth mentioning that the comparison is not a fair one as our method requires training but the number of parameters does not require training.

Reviewer #2
RC: This reviewer appreciates authors' effort to revise the initial manuscript and address individual comments.While this revised version addresses some of the initial concerns, a few important ones need further attention: • RC: Second, comparison against ZiCo only is not enough; comparisons against another approaches are needed to see the consistency of the results.At the very minimum, comparisons against Zen-Nas and NASWOT (refs [1] and [2] in the Reviewer#2 initial comment) are needed, with all details given in the Supplementary, to understand how this approach fares compared to existing approaches.
The hyperparameters are the same as reported in the references.For the NASWOT score, we use a batch size of 128 and 1 minibatch.For ZenNAS score, we use a batch size of 16 and 32 minibatches.Following the suggestion, we included new comparisons to ZiCo, NASWOT, and ZenNAS in the revised Supplementary Information section COMPARISON WITH STANDARD NAS BENCHMARKS.The correlations between the NAS metric and model performance are computed using randomly sampled architectures.Using subset of architectures in the NAS-Bench-201 search space, the Spearman correlation ρ for ZiCo, NASWOT, ZenNAS is 0.59, 0.58 and -0.04, respectively.Our method has the correlation of 0.76 in comparison.The correlation for all architectures in the NAS-Bench-201 search space is 0.81 (ZiCo), 0.77 (NASWOT), and 0.35 (ZenNAS).The correlation for NASWOT and ZenNAS is reported in [2] while the correlation for ZiCo is reported in [3].
• RC: Third, authors comment about the superiority of their approach compared to ZiCo needs to be rephrased since their approach involves training, hence it is expected to be provide better accuracy compared to zero-shot approaches.On the other side, zero-shot approaches have their inherent advantages so it's hard to talk about superiority in such a context.
AR: We fully agree with Reviewer #2 that it is hard to talk about superiority in such a context.We rephrase our statement regarding the comparison of our method with training-free NAS methods as also listed below.The comparison is included in the updated supplementary materials.
"Our method shows a higher correlation on NAS-Bench-201 compared to ZiCo and NASWOT.It is worth mentioning that direct comparison solely on the correlation is unfair since our method is not training-free." RC: Including authors relevant responses in the main manuscript/Supplementary.
AR: Based on both reviewers' comments in our last version, we did supplementary experiments and put them in the SI.We thank Reviewer #2 for this great suggestion.We have incorporated these new results and additional details in the SI.Specifically, here are the main places we made significant changes: • We include the computational cost in the Supplementary Information.The section title is COMPUTATIONAL COST OF THE PROPOSED FRAMEWORK.
architectures.Note that a small set of architectures is a subset of large set.In other words, when increasing the total number of architectures, we include the original randomly selected set.We find the ranking correlation increases as we increase the number of randomly sampled architectures.
eff with the Bayesian ridge regression from the early training epochs.The relation is essentially a nonlinear function I − h(β eff ; θ) = 0.It describes the training trajectory from a new perspective and is expected to be consistent throughout the training process.As the model converges, β eff approaches zero (Theorem 1), we have I * − h(0; θ) = 0. Now we can directly derive from the nonlinear function a validation accuracy I * = h(0; θ), which is exactly the final accuracy.The main purpose is to obtain the final accuracy with fewer training epochs.

Figure 1 :
Figure 1: Application of Bayesian ridge regression to accuracy prediction As an addition to the above notations and Fig 1 (c) in the main manuscript, we included an extra diagram on the right side to illustrate how we apply the Bayesian ridge regression in predicting the final accuracy.The orange dots are observations collected from the early training stage and used to fit the relation between the trainingβ eff,k s and validation accuracy I k s, and then to extrapolate the final accuracy when β eff → 0, which we believe will be held when the model is converged.To be noted that some of the very early epochs may be noisy and are discarded from the regression.The number of observations and the start epoch of the observation window are determined by the Bayesian information criterion (BIC).See the last sentence of subsection "Neural network model selection with the neural capacitance β eff " in the main manuscript for details.

Figure 3 :
Figure 3: Training top 1 error (circle) versus validation top 1 error (squares) for different batch size: 16, 32, and 64.Left figure is ResNet18 model and right figure is ResNet34 model.

Algorithm 1 :
Implement NCP and Compute β eff the specific steps (e.g.Step 4 and 5) on how to compute the proposed metric.

Figure 4
Figure 4: Our approach

Figure 6 :
Figure 6: LSV method RC: GPU-hour spent for estimating the accuracy is necessary for comparison with existing works.This should include the time spent on model warmup epochs (training epochs before the starting epoch t 0 ), data collection epochs, and metric computation.

Figure 1 :
Figure 1: Schematic illustration of Bayesian regression

Algorithm 1
If the reviewer's mention of 'training' refers to the Bayesian regression model, the NCP's role is to collect the encoded signals from bottom layers of the entire neural network, enabling the computation of observations, i.e., β eff and validation accuracy.These observations are then utilized by the Bayesian regression model to learn the implicit relationship between β eff and validation accuracy.The ultimate goal is to apply Theorem 1 on this Bayesian regression model, extrapolating the final accuracy.Obtaining the final accuracy of the neural network without our framework typically requires training the model until it converges.Our framework saves the extra work.Despite the absence of any training on the NCP, we still can predict the neural network's final accuracy based on our framework.We hope the above explanation clarifies the relations between NCP and β eff , and β eff and the final accuracy.In the main paper, we illustrate the proposed algorithm as the algorithm shown below: Implement NCP and Compute β eff Input: Pre-trained source model F s = {F

1: Remove F ( 2 )
s from F s and add on top of F (1) s an NCP unit U with multiple layers (Fig. 1b) 2: Randomly initialize and freeze U 3: Train target model F t = {F (1) s , U} by fine-tuning F (1)

Figure 2 :
Figure 2: Ranking correlation between the predicted performance and the real performance for models in the overfitting regime.

Figure 4 :
Figure 4: Spearman correlation of predictors and test accuracy.Left: our method.Middle: BSV method.Right: LSV method.

Figure 1 :
Figure 1: Spearman correlation ρ of the proposed method on NAS-Bench-201 benchmark using different number of randomly sampled architectures.

Table 1 :
Running time in GPU hours for our framework in estimating the accuracy.It involves two steps: 1) transfer learning to freeze NCP unit and then fine-tune the pre-trained model until it's converged; 2) compute the per epoch β eff s.
The subset of the training data is randomly sampled and each training epoch uses the same subset of the training dataset.We use the exactly same hyperparameters to fine-tuning the model, compute β eff , and predict the model performance.Accuracy is predicted on CIFAR-10 dataset for ResNet50, ResNet34, ResNet18, ResNet152, ResNet101, DenseNet201, Densenet169, DenseNet161 and DenseNet121.The average test accuracy is 62.08 ± 8.25 while the average train accuracy is 79.78 ± 11.63.We compute the Spearman correlation ρ between the test accuracy and predicted validation accuracy.The correlation is shown in the Figure2.Due to the overfitting, the model performance is seriously degraded.But our proposed method rank model performance reasonably well.